INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

1- Importing necessary libraries and data

2- Data Overview

Observation

Observation

Observation

Observation

Observation

Observation

3- Exploratory Data Analysis (EDA)

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

3-1 Univariate Analysis

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

3-2 Bivariate Analysis

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

Observation

4- Data Preprocessing

4-1 Missing value treatment

Observation

4-2 Feature engineering

4-3 Outlier detection and treatment

5- EDA

Observation

6- Data Preparation for modeling

6-1 Splitting data into train and test

6-2 Check model parameters and performance

7- Building a Logistic Regression model

7-1 Model evaluation criterion

Model can make wrong predictions as:

Which case is more important?

How to reduce the losses?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

7-2 Logistic Regression (with statsmodels library)

Checking model performance

Observations

7-3 Checking Multicollinearity

Observations:

  1. Dropping market_segment_type and room_type_reserved doesn't have a significant impact on the model performance.
  2. We can choose any model to proceed to the next steps.
  3. Here, we will go with the lg2 model.
  4. All of the categorical levels of a variable have VIF<5 which is good.

Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

7-4 Converting coefficients to odds

Coefficient interpretations

7-5 Model performance evaluation

ROC-AUC

7-6 Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

7-7 Final Model Summary

Model Performance Summary

Let's check the performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

ROC curve on test set

Using model with threshold=0.317

Using model with threshold = 0.42

Model performance summary

Conclusion

8- Building a Decision Tree model

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

8-1 Build the model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. Other option include 'entropy'.

Checking model performance on training set

Checking model performance on testing set

Visualizing the Decision Tree

The tree above is very complex, such a tree often overfits.

8-2 Do we need to prune the tree?

8-2-1 Using GridSearch for Hyperparameter tuning of our tree model

Checking model performance on training set

Checking model performance on testing set

Visualizing the Decision Tree

8-2-2 Cost Complexity Pruning

F1 Score vs alpha for training and testing sets

Checking model performance on training set

Checking model performance on testing set

8-3 Model Performance Comparison and Conclusions

9- Actionable Insights and Recommendations

Conclusion

Recommendation